Skip to content

Fix fork-time deadlock between ConfigLoader and threadSafeForkPrepare#1375

Open
robieta wants to merge 1 commit into
pytorch:mainfrom
robieta:export-D102071920
Open

Fix fork-time deadlock between ConfigLoader and threadSafeForkPrepare#1375
robieta wants to merge 1 commit into
pytorch:mainfrom
robieta:export-D102071920

Conversation

@robieta
Copy link
Copy Markdown

@robieta robieta commented Apr 23, 2026

Summary:
Fix a deadlock when torch._thread_safe_fork (used by torch.compile's
SubprocPool) triggers threadSafeForkPrepare() while kineto's ConfigLoader
background thread is active.

The deadlock occurs because:

  1. threadSafeForkPrepare() acquires exclusive guardLock, then calls
    SingletonVault::destroyInstances()
  2. The kShutdownConfigLoaderOnSingletonVaultDestroy callback fires and calls
    ConfigLoader::stopThread(), which tries to join() the ConfigLoader thread
  3. The ConfigLoader thread is blocked trying to acquire a shared
    ThreadSafeForkGuard in DynoConfigLoader::readBaseConfig() — but cannot
    because guardLock is held exclusively by the forking thread

This is a classic AB-BA deadlock: Thread A holds guardLock and waits for
Thread B to join; Thread B waits for guardLock (shared) and cannot exit.

The fix splits stopThread() into a non-blocking signalStop() (sets
stopFlag_ + notifies condvar) and the existing stopThread() (signal + join).
The singleton vault callback now uses signalStop(). The ConfigLoader thread
exits cooperatively on its next loop iteration after guardLock is released.
The actual join is deferred to ~ConfigLoader() at static-destruction time.

Differential Revision: D102071920

Summary:
Fix a deadlock when `torch._thread_safe_fork` (used by `torch.compile`'s
`SubprocPool`) triggers `threadSafeForkPrepare()` while kineto's ConfigLoader
background thread is active.

The deadlock occurs because:
1. `threadSafeForkPrepare()` acquires exclusive `guardLock`, then calls
   `SingletonVault::destroyInstances()`
2. The `kShutdownConfigLoaderOnSingletonVaultDestroy` callback fires and calls
   `ConfigLoader::stopThread()`, which tries to `join()` the ConfigLoader thread
3. The ConfigLoader thread is blocked trying to acquire a shared
   `ThreadSafeForkGuard` in `DynoConfigLoader::readBaseConfig()` — but cannot
   because `guardLock` is held exclusively by the forking thread

This is a classic AB-BA deadlock: Thread A holds guardLock and waits for
Thread B to join; Thread B waits for guardLock (shared) and cannot exit.

The fix splits `stopThread()` into a non-blocking `signalStop()` (sets
`stopFlag_` + notifies condvar) and the existing `stopThread()` (signal + join).
The singleton vault callback now uses `signalStop()`. The ConfigLoader thread
exits cooperatively on its next loop iteration after `guardLock` is released.
The actual join is deferred to `~ConfigLoader()` at static-destruction time.

Differential Revision: D102071920
@meta-cla meta-cla Bot added the cla signed label Apr 23, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 23, 2026

@robieta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102071920.

@robieta
Copy link
Copy Markdown
Author

robieta commented Apr 23, 2026

-- Initializing git submodules...
  error: expected submodule path 'third_party/kineto' not to be a symbolic link
  CMake Error at cmake/PreBuildSteps.cmake:52 (message):
    Git submodule initialization failed.  Please run:

CPU build failure looks unrelated to my changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant